Skip to content

docs: parquet-format + socket-analysis guides (re-target to main)#41

Merged
randomizedcoder merged 6 commits into
mainfrom
docs/parquet-format
Jun 20, 2026
Merged

docs: parquet-format + socket-analysis guides (re-target to main)#41
randomizedcoder merged 6 commits into
mainfrom
docs/parquet-format

Conversation

@randomizedcoder

Copy link
Copy Markdown
Owner

Adds docs/parquet-format.md — a consumer-facing guide to the S3/Parquet export, written for an enterprise data/analytics team that has only a basic grasp of TCP.

Stacked on #39 (docs/protobuf-formats) so the cross-link to that doc resolves. Merge #39 first; this then retargets to main.

What it covers

  • File layout — Hive partitioning host=/date=/hour= (UTC), object naming, how engines expose the partitions.
  • Size / cadence / compression — ~63 MiB uncompressed soft cap (-s3ParquetFlushBytes), per-column ZSTD (strings/bytes) + SNAPPY (numerics).
  • Reading it — DuckDB, pandas/pyarrow, Trino/Athena snippets with partition pruning and column projection.
  • The grain — one row per socket per poll; counters are cumulative; track a connection via inet_diag_msg_socket_cookie.
  • "Start here" columns — the high-value subset with units (RTT µs, snd_cwnd packets, delivery_rate bytes/s, total_retrans, byte counters, congestion algorithm) so the team knows where to focus first.
  • Decoding cheat sheet — raw-byte IPs via inet_diag_msg_family, the TCP state integer→name map, congestion enum, timestamp_ns.
  • Full schema grouping + types, proto3 no-NULL/zero-default gotchas, and where the schema is defined (ParquetRow + the drift test that keeps Parquet/proto/ClickHouse in lockstep).

Cross-linked from the docs hub and output-and-destinations.md (S3 section).

Notes

  • Grounded in the actual code: ParquetRow (destinations_s3parquet_schema.go), the objectKey layout, and the 63 MiB flush cap.
  • Verified: all relative links (../pkg/…, ../proto/…, sibling docs) resolve; no broken intra-doc anchors.

🤖 Generated with Claude Code

randomizedcoder and others added 5 commits June 19, 2026 20:28
New docs/parquet-format.md explains the S3/Parquet export for an enterprise
data/analytics audience consuming xtcp2's TCP telemetry:
- Hive partition layout (host=/date=/hour=, UTC) and object naming
- file size/cadence (~63 MiB uncompressed soft cap) and per-column
  compression (ZSTD strings/bytes, SNAPPY numerics)
- how to read it (DuckDB/pandas/Trino) with partition pruning
- the grain (one row per socket per poll; cumulative counters; socket cookie)
- a 'start here' set of the key TCP columns with units (rtt µs, cwnd packets,
  delivery_rate bytes/s, total_retrans, byte counters, congestion algo)
- decoding cheat sheet (raw-byte IPs via family, TCP state map, enums, ts)
- full schema grouping + types, proto3 no-null gotchas, and where the schema
  is defined (ParquetRow + drift test).

Cross-linked from the docs hub and output-and-destinations (S3 section).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Short, name-matched COPY INTO recipe (file format + stage + INFER_SCHEMA
auto-create + MATCH_BY_COLUMN_NAME), an external-table/AUTO_REFRESH note for
continuous ingest, and the two Snowflake gotchas: path-based Hive partitions
(derive from metadata$filename) and BINARY address columns.
New docs/socket-analysis.md: a methodology guide for finding the natural RTT
bands statistically (min_rtt on a log scale; GMM+BIC for adaptive, drift-aware
bands; Jenks/KDE simple alternative; Snowflake quantile quick-win), with
labeling/validation against dest ASN/geo and per-DC/over-time tracking. Adds
multi-feature clustering (HDBSCAN) and other groupings (throughput, loss,
congestion algo, per-ASN, diurnal), a worked SQL→Python example, and a
pitfalls section (per-socket grain, cumulative counters, µs units, survivorship,
app-limited throughput, drift). Cross-linked from the docs hub and parquet doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs: socket-analysis guide — RTT bands & clustering for data teams
@randomizedcoder randomizedcoder changed the title docs: Parquet format reference for data teams docs: parquet-format + socket-analysis guides (re-target to main) Jun 20, 2026
@randomizedcoder randomizedcoder changed the base branch from docs/protobuf-formats to main June 20, 2026 20:58
@randomizedcoder

Copy link
Copy Markdown
Owner Author

Re-targeted to main. This branch already contains both docs/parquet-format.md and docs/socket-analysis.md (PR #42 was merged into this branch), plus the Snowflake section and exact column count. Merging this lands both docs on main — they never arrived because #41/#42 were stacked on branches that merged first.

@randomizedcoder randomizedcoder merged commit 69a94f2 into main Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant